Assembling the Kazakh Language Corpus
نویسندگان
چکیده
This paper presents the Kazakh Language Corpus (KLC), which is one of the first attempts made within a local research community to assemble a Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five stylistic genres: literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) annotated sub-corpus, containing segmented documents encoded in the eXtensible Markup Language (XML) that marks complete morphological, syntactic, and structural characteristics of texts; (ii) as well as a sub-corpus with the annotated speech data. KLC has a web-based corpus management system that helps to navigate the data and retrieve necessary information. KLC is also open for contributors, who are willing to make suggestions, donate texts and help with annotation of existing materials.
منابع مشابه
Religious and Educational Worldview in the Works of Poets of the Region of SYR and Their Educational Value
In the Kazakh and world literature an important place is occupied by the work of poets, whose works are imbued with religious and educational ideas. Their work is one of the main directions in the history of the formation and development of Kazakh literature. Works of religious content teach humanity and morality through the concepts of religion. The article is devoted to the study religious an...
متن کاملKazakh Segmentation System of Inflectional Affixes
This paper focuses on the automatic segmentation of inflectional affixes of the Kazakh Language (KL) on the basis of studying the corpus of KL. Kazakh is an agglutinative language with word structures formed by productive affixation of derivational and inflectional suffixes to stems. Based on the analysis of the configuration of inflectional affixes, it firstly constructs the Finite-State Autom...
متن کاملPhilosophical Worldview and Pedagogical Perspectives of the Poets-Zhyrau of the Aral Sea and Syr Darya Areas
Each nation has its own specific features, national character, moral norms, customs, manners, traditions, and lifestyle. Each nation has its own culture that has developed over hundreds of years and certainly affects the people’s way of life and educational process. The desire to teach descendants progressive traditions and advanced morality, to cultivate their positive qualities is a sign of r...
متن کاملRule Based Morphological Analyzer of Kazakh Language
Having a morphological analyzer is a very critical issue especially for NLP related tasks on agglutinative languages. This paper presents a detailed computational analysis of Kazakh language which is an agglutinative language. With a detailed analysis of Kazakh language morphology, the formalization of rules over all morphotactics of Kazakh language is worked out and a rule-based morphological ...
متن کاملLanguage(s)? Evidence from Kazakhstan’s Shift in State Language and Language of Instruction
This paper investigates the economic returns to language skills and bilingualism. The analysis is staged in Kazakhstan, a multi-ethnic country with complex ethnic settlement patterns that has switched its official state language from Russian to Kazakh. Using two newly assembled data sets, we find negative returns to speaking Kazakh and a negative effect of bilingualism on earnings while Russian...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013